System Design of Job scheduler in Golang
Last Edited | |
---|---|
Tags |
Background:
Container tech in recent years has been used a lot in web development. I am very interested in engineering some useful applications that works in docker and Kubernetes environment. In previous posts, I wrote down my study of K8S system. In this post, I am going to continue my journey and build tools with GoLang. This post will be more formal, similar to a system design document.
Functional requirements:
Design a microservices system works as periodically crawling job information. Then send out email notification based on user preference. It should be in golang, docker, and k8s (potentially) for studying purpose.
- Types of jobs and numbers of companies to crawl is limited (<10 companies * 3 types), therefore concurrent processing is ok. In K8S situation (distributed service), this will be K8S nodes and Pods. If this scales to a hundred or thousand crawling task at the same time, then a message queue is needed. (Out of Scope)
- Task status update: set a web server having an endpoint (“tasks/<uuid>”) to update task. Here’s the workload estimation of the web server. Since concurrent crawlers are limited to 30 and each crawling in 2 hours window have around 20 new jobs, 600 requests will be there. Suppose 500 ms in average each worker can complete crawling for a job posting, then there will be 2 * 30 = 60 requests per second, this is ok for a simple DB update.
- Suppose the system has thousands or millions of subscribers, then a queue is needed before sending user email notification.
Design choice:
- Golang for job scheduler manager, because it is designed for high concurrent applications, and easy to communicate with kubernetes services.
- Python for crawler. Python has build in easy-to-use crawling and html parser modules like selumni and beautiful soup.
- Docker for containerization. Each unit is containerized and serves as microservice, which can be managed by kubernetes or docker-compose.
- RabbitMQ for emailing queuing. I am only handling cases where there are lots of subscribed users and limited numbers of crawling objects. In this case, message queue is needed to send user notification.
- Postgres for storing task, job, and user info. possibly add indexing on task uuid to speed up writes.
- MongoDB for storing crawled job information. URL would be the key to check a particular job is visited or not.
Overall Design

Detailed Design

- as shown in graph, there will be two threads to start and monitor python crawler.
- in docker env, go cron expose its docker.socket (unix domain socket) and communicate through writing and reading to the file
- in k8s env, go cron will communicate k8s api server to spawn new process. (depending time, this may be not implemented)
- as for concurrency control, the design is to pass a env variable about numbers of concurrent threads and then use semaphore to control numbers of concurrent threads
- further discussion, a threading pool can be implemented to dynamically allocate process and threads depending on host CPU.
UI and Email


Better Design (* may or may not implement)

While in the progress of implementation (nearly finish my previous design), I thought there’s a design that has better storage management and more efficient API response on the UI. Here’s a brief overview.
- this is a map reduce like infrastructure
- the system use message queue like in-memory data store to temporarily store task info, and then a reducer service listen to the queue, check if all jobs finished then start email composing, update url stores, and clean DB storing job details. I don’t need permanent storage for each task while just need to maintain logs for each service and worker.
- the url store stores latest job urls for each company and job type pair. It is updated after each round of crawling. This serves low latency experience if user want to find latest urls and see numbers of job posts per company, because previously the system need to query large volume of history data in postgres database.
* Notes for EC2 Setup
- git
sudo yum update -y sudo yum install git -y
- docker
- docker-compose
sudo curl -L https://github.com/docker/compose/releases/latest/download/docker-compose-$(uname -s)-$(uname -m) -o /usr/local/bin/docker-compose sudo chmod +x /usr/local/bin/docker-compose docker-compose version